A Multi-Armed Bandit Approach for Online Expert Selection in Markov Decision Processes

نویسندگان

  • Eric Mazumdar
  • Roy Dong
  • Vicenç Rúbies Royo
  • Claire J. Tomlin
  • S. Shankar Sastry
چکیده

We formulate a multi-armed bandit (MAB) approach to choosing expert policies online in Markov decision processes (MDPs). Given a set of expert policies trained on a state and action space, the goal is to maximize the cumulative reward of our agent. The hope is to quickly find the best expert in our set. The MAB formulation allows us to quantify the performance of an algorithm in terms of the regret incurred from not choosing the best expert from the beginning. We first develop the theoretical framework for MABs in MDPs, and then present a basic regret decomposition identity. We then adapt the classical Upper Confidence Bounds algorithm to the problem of choosing experts in MDPs and prove that the expected regret grows at worst at a logarithmic rate. Lastly, we validate the theory on a small MDP. ar X iv :1 70 7. 05 71 4v 1 [ cs .S Y ] 1 8 Ju l 2 01 7

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Online Regret Bounds for Markov Decision Processes with Deterministic Transitions

We consider an upper confidence bound algorithm for Markov decision processes (MDPs) with deterministic transitions. For this algorithm we derive upper bounds on the online regret (with respect to an (ε-)optimal policy) that are logarithmic in the number of steps taken. These bounds also match known asymptotic bounds for the general MDP setting. We also present corresponding lower bounds. As an...

متن کامل

One-armed bandit models with continuous and delayed responses

One-armed bandit processes with continuous delayed responses are formulated as controlled stochastic processes following the Bayesian approach. It is shown that under some regularity conditions, a Gittins-like index exists which is the limit of a monotonic sequence of break-even values characterizing optimal initial selections of arms for finite horizon bandit processes. Furthermore, there is a...

متن کامل

Multi-armed Bandit Online Learning Based on POMDP in Cognitive Radio

In cognitive radio, most of existing research efforts devoted to spectrum sharing have two weakness as follows. First, they are largely formulated as a Markov decision process (MDP), which requires a complete knowledge of channel. Second, most of the studies are online learning based on perceived channel. To solve the above problems, a new algorithm is proposed in this paper: if the authorized ...

متن کامل

PAC Bounds for Multi-armed Bandit and Markov Decision Processes

The bandit problem is revisited and considered under the PAC model. Our main contribution in this part is to show that given n arms, it suffices to pull the arms O( n 2 log 1 δ ) times to find an -optimal arm with probability of at least 1 − δ. This is in contrast to the naive bound of O( n 2 log n δ ). We derive another algorithm whose complexity depends on the specific setting of the rewards,...

متن کامل

Q-Learning for Bandit Problems

Multi-armed bandits may be viewed as decompositionally-structured Markov decision processes (MDP's) with potentially very large state sets. A particularly elegant methodology for computing optimal policies was developed over twenty ago by Gittins Gittins & Jones, 1974]. Gittins' approach reduces the problem of nding optimal policies for the original MDP to a sequence of low-dimensional stopping...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1707.05714  شماره 

صفحات  -

تاریخ انتشار 2017